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Abstract — Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic modeling paradigm, and recently finds many applications 
in computer vision and computational biology. This paper proposes a fast and accurate algorithm, active belief propagation (ABP), for 
training LDA. Usually training LDA requires repeated scanning of the entire corpus and searching the complete topic space. Confronted 
with massive corpus with large number of topics, such a training iteration is often inefficient and time-consuming. To accelerate the 
training speed, ABP actively scans partial corpus and searches partial topic space for topic modeling, saving enormous training time 
in each iteration. To ensure accuracy, ABP selects only those documents and topics that contribute to the largest residuals within the 
residual belief propagation (RBP) framework. On four real-world corpora, ABP performs around 10 to 100 times faster than some of the 
major state-of-the-art algorithms for training LDA, while retains a comparable topic modeling accuracy. 

Index Terms — Latent Dirichlet allocation, topic models, residual belief propagation, active belief propagation, message passing, factor 
graph, Markov random fields, Gibbs sampling, variational Bayes. 



o 

o 



> 
o 



o 

(N 



X 



1 Introduction 

The probabilistic topic modeling techniques divide the 
non-zero elements in the document-word matrix into 
several thematic groups called topics, which have found 
many real-world applications in text mining, computer 
vision, and computational biology [1]. As the simplest 
topic model, latent Dirichlet allocation (LDA) |2J repre- 
sents each document as a mixture of topics and each 
topic as a multinomial distribution over the fixed vo- 
cabulary. From the labeling point of view, LDA assigns 
the hidden topic labels to explain the observed words in 
document- word matrix [3\. This labeling process defines 
a joint probability distribution over the hidden labels 
and the observed words. Employing the Bayes' rule, we 
can infer topic labels from observed words by computing 
the posterior distribution of the hidden variables given 
the observed variables from their joint probability. Such 
inference techniques have been widely-used for learn- 
ing probabilistic graphical models within the Bayesian 
framework (4). Recent approximate inference methods 
for LDA |5J fall broadly into three categories: variational 
Bayes (VB) [2], collapsed Gibbs sampling (GS) [6], and 
loopy belief propagation (BP) 0. 

VB is a variational message passing method [7 J that 
uses a set of factorized variational distributions to ap- 
proximate the objective joint distribution by minimizing 
the Kullback-Leibler (KL) divergence between them. Be- 
cause there is always a gap between the variational dis- 
tribution and the true joint distribution, VB introduces 
bias when training LDA. In contrast, GS approximates 
this joint distribution by the Markov Chain Monte Carlo 
(MCMC) process, which is theoretically more accurate 



/. Zeng is with the School of Computer Science and Technology, Soochow 
University, Suzhou 215006, China. To whom correspondence should be 
addressed. E-mail: j.zeng@ieee.org. 

Z.-Q. Liu and X.-Q. Cao are with the School of Creative Media, City 
University of Hong Kong, Hong Kong, China. 



but converges practically much slower than VB. Repre- 
senting LDA by a factor graph 00, 0, BP directly infers 
the marginal probability of hidden topics over observed 
words called messages from their joint probability, so that 
it is currently the most accurate algorithm for training 
LDA. Also, the message passing process of BP converges 
much faster than GS [3J. Similar BP implementation has 
also been discussed as the CVB0 algorithm within the 
mean-field framework |5), ||9)- 

However, all above algorithms for training LDA re- 
quire multiple iterations of scanning the entire corpus 
and searching the complete topic space. So, the com- 
putational cost increases linearly with the number of 
documents D, the number of topics K, and the num- 
ber of training iterations T. Confronted with massive 
corpora with large number of topics, these conventional 
algorithms are often inefficient for fast topic modeling 
because of high per-iteration cost. For example, when 
D = 10 6 , K = 10 3 and T = 500, GS consumes around 
forty hours to infer the hidden topics. Therefore, we are 
in great need of the fast topic modeling algorithms for 
real-world massive corpora. 

In this paper, we propose a fast topic modeling algo- 
rithm for training LDA: active belief propagation (ABP). 
The basic idea is to actively select partial corpus and 
search partial topic space at each training iteration. In 
theory, ABP is based on the residual BP (RBP) flTUH 
framework, which uses an informed scheduling for asyn- 
chronous message passing. First, we define the message 
residual as the absolute difference between messages 
at successive training iterations. The residuals evaluate 
the convergence speed of messages, where the larger 
residuals correspond to the faster convergence of mes- 
sages. Second, through sorting residuals at each iteration, 
we actively select those documents and topics with the 
largest residuals for message updating and passing. For 
example, if ABP selects 10% documents and 10% topics 
for message passing at each iteration, it may be 100 
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times faster than conventional BP algorithms. Similar 
strategies that select to optimize those fast-converging 
messages have been discussed in active learning with 
statistical models [11]. Extensive experiments on several 
real-world corpora have demonstrated that ABP can 
achieve a significant speedup, while can retain a similar 
topic modeling accuracy as compared with current state- 
of-the-art algorithms for training LDA. 

The proposed ABP algorithm can be also interpreted 
as a fast expectation-maximization (EM) algorithm [12J. 
At the E-step, ABP infers and updates partial messages 
with the largest residuals. At the M-step, ABP maximizes 
the likelihood by estimating parameters based on partial 
messages. After multiple iterations, it often converges at 
the local maximum of the joint probability. Because the 
residuals reflect the convergence speed of messages, ABP 
focuses all computational resources on inferring and up- 
dating those fast-converging messages at each iteration. 
In this way, it may achieve almost the same performance 
as BP. Therefore, ABP has the potential to be extended 
for learning other finite mixture models based on EM 
algorithms, such as Gaussian mixture models [4J. Similar 
principle can be also applied to developing fast k-means 
clustering algorithm [4). 

In Section|2]we review some recent fast topic modeling 
techniques, and in Section [3] we introduce the factor 
graph interpretation of LDA and the residual BP (RBP) 
algorithm for training LDA. Section[4]proposes the active 
belief propagation (ABP) algorithm based on the residual 
belief propagation (RBP) framework. Section [5] compares 
ABP with the state-of-the-art algorithms on real-world 
text corpora. Finally, we draw conclusions and envision 
future work in Section [6] 

2 Fast Topic Modeling Techniques 

Currently, online and parallel learning algorithms are 
two major fast topic modeling techniques. Online learn- 
ing algorithms for LDA [13J, [14J partition the entire 
corpus into a set of mini-batches. For each mini-batch, 
online learning algorithms converge significantly faster 
than offline algorithms, and thus save enormous training 
time by reducing the total number of iterations T. For 
example, online GS IIT31 algorithms have two strategies. 
The first, called incremental LDA, periodically resamples 
the topic assignments for previously analyzed words. 
The second approach uses particle filtering instead of GS 
to assign the current topic based on previous topic con- 
figurations. Despite of faster speed, none of the strategies 
perform as well as offline GS algorithms. On the other 
hand, online VB algorithms [14] are based on online 
stochastic optimization, which can converge to a local 
optimum of the offline VB objective function. However, 
they require manually setting several free parameters 
including the mini-batch size, which controls the best 
topic modeling performance based on heuristics. 

Parallel learning algorithms for LDA [15J, [16J dis- 
tribute large data sets across separate processors to 



accelerate the topic modeling process. For example, par- 
allel GS algorithms [15J approximate the asynchronous 
GS sampling process by synchronous updating of the 
global topic distributions. Parallel VB algorithms [16] 
use the Map-Reduce technique to scale the probabilistic 
topic modeling. While these parallel techniques often 
provide almost the same topic modeling performance 
as single-processor algorithms, they require expensive 
parallel hardware so that the performance-to-price ratio 
remains unchanged. Moreover, parallel algorithms of- 
ten consume additional resources to communicate and 
synchronize the topic modeling results from multiple 
processors. In practice, 1024 processors can achieve only 
around 700 times speedup corresponding to a low par- 
allel efficiency of approximately 0.7 1151 . Finally, it is 
often difficult to write and debug parallel algorithms on 
distributed systems. 

Current online and parallel techniques ignore han- 
dling the large number of topics K in massive corpora, 
which also increases the computational cost. The fast 
GS (FGS) algorithm [17J introduces the upper bound on 
the normalization factor using the Holder's inequality. 
Without visiting all possible K topics, FGS is able to 
draw the target topic sample from the approximate 
posterior probability normalized by the upper bound. 
By refining the upper bound to approximate the true 
normalization factor, FGS samples topic label equal to 
that drawn from the true posterior probability as GS 
does. In this sense, FGS yields exactly the same result 
as GS but with much less computation. 

3 Residual Belief Propagation 

The probabilistic topic modeling task can be interpreted 
as a labeling problem, in which the objective is to assign 
a set of semantic topic labels, z = {z^ d }, to explain 
the observed document-word co-occurrence matrix, x = 
{x w ,d}- The notations 1 < w < W and 1 < d < D are the 
word index in vocabulary and the document index in 
corpus. The notation 1 < k < K is the topic index. The 
nonzero element x w d ^ denotes the number of word 
counts at the index {w,d}. 

Fig. UK is the three-layer graphical representation of 
LDA [2J. The document-specific topic proportion 9d(k) 
generates a topic label, z l w d l e {0, 1}, E fc =i = L 
which in turn generates each observed word token n 
at the index {w,d} based on the topic-specific multino- 
mial distribution <fik (w) over the vocabulary words. Both 
multinomial parameters 9d{k) and 4>k(w) are generated 
by two Dirichlet distributions with hyperparameters a 
and f3, respectively. For simplicity, we consider only a 
smoothed LDA with fixed symmetric Dirichlet hyperpa- 
rameters a and /? (6J. The plates indicate replications. 
For example, there are K topics, the document repeats 
D times in the corpus, the word repeats times in the 
document d, and the vocabulary size is W. 

Fig. [Tj3 interprets Fig. [TJ\ by the two-layer factor 
graph |U, 10. The notation z^ d = YhZx ^w,d,J x w,d is 




(A) (B) (C) 

Fig. 1 . (A) Three-layer graph, (B) two-layer factor graph, and (C) message passing for LDA. 



the average topic labeling configuration over all word 
tokens 1 < i < x w ^ at the index {w,d}, where x w ,d is 
the number of word counts at the index {w, d}. Note that 
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recovers the original topic con- 



figuration over the word tokens in Fig. The factors 
Od and (f) w are denoted by squares, and their connecting 
variables z w .d are denoted by circles. The document- 
word co-occurrence matrix x = {x W:d } is denoted by 
shaded circles. The factor 9d connects topic labels z. i( j 
on different word indices within the same document d, 
while the factor <f> w connects topic labels z w> . on the same 
word index w but in different documents. We absorb 
the observed word index w as the index of <\> w , which 
is similar to absorbing the observed document d as the 
index of 6 d in Fig. QJ\- Notice that Fig. UK and Fig. [T^ 
have the same topology because both Od and 4> w have 
the same neighbors. 

Integrating out parameters {9, <fi} based on Dirichlet- 
Multinomial conjugacy yields the joint probability [18J 
of LDA in Fig. UK, 

D K r(T W x ,,z k +a) 
P(x,z\a,p) oc || H ^ ^ — x 

d=l k=l r[X)fe=l (J2w=l Xw , dZ w,d + a )] 
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where /(■) is the factor function, and z. )( j = {z Wi d, Z- Wi d} 
and z w> . = {z w ^ } z w ,-d} denote subsets of the variables 
in Fig. [lj>. Thus, the joint probability (TJJ of LDA can be 
re-written as the product of factor functions [4] of the 
factor graph in Fig. [l}3. Because Fig. OK and B encode 
the same joint probability, they represent the same LDA 



model but from different perspectives. 

Loopy BP [3 1 is an approximate inference method for 
factor graphs with loops in Fig. QJ}. Rather than directly 
computing the posterior probability p(z\x), we calculate 
the marginal probability p(z k d , x Wid \zt w - d i X-w-d), re- 



ferred to as message, \x{: 



w,d 



1) = fJ-w,d(k), which can 



be normalized efficiently using a local computation, i.e., 

J2k=i^A k ) = 1,0 < fJ. w ,d( k ) < 1- According to the 
Markov property and Fig. [TJ3, we obtain 



P{ Z w,di x w,d\ 
P( Z w,d' x M,d\ 



dW- 



ij t —dl x —w, — d) CX 

x -w,d)p( Z w,d' X ™,d\'Z'w,-d' x w,-d), (2) 



where —w and — d denote all word indices except 
w and all document indices except d, and the nota- 
tions z- w .d and z w .-d represent all possible neighbor- 
ing labeling configurations. From the message passing 
view, p(z^ ] ^x w ^ k - Wtd -. yL ~w,d) is the neighboring mes- 
sage ne d ^ tZw d (k) sent from the factor node Od, and 
P( z t,d\ z w ,-d' x ™ -d) is the other neighboring message 
AV™-^™,d(k) sent from the factor node (f> w . Eq. (|2) en- 
courages only K smooth topic configurations within the 
neighborhood system. 

Using the Bayes' rule and the joint probability Q), we 
expand ((2) as 



P(z k ,d,*;d) 
P( Z -w,d> x -w,d 

E- W x 



P( Z w,-' X 'UJ.- ) 



d z -w,d 



P{^ w ,-d^-d) 

a 



£ft=l(2C— u; x -w,d z - w ,d + a ) 
Y^tw=l(S-d Xw ,- dZ w,-d + P) 



(3) 



where we use the property, T(x + 1) = xT(x), to cancel 
the common terms in both nominator and denominator. 
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Eq. © updates the message of d if its neighboring 
topic configuration {z't w d , zj, _ d } is known. However, 
due to uncertainty, we know only the neighboring mes- 
sages rather than the precise topic configuration. So, we 
replace the topic configuration by the messages in Eq. l(3) 
and obtain the message update equation, 



where 



■ X ; 



Ekl»- W , d (k) + a] Ej/v-d(*0 + PX 



(4) 



fJ'-wJk) = ^2 x -w,dV>-w,d( k )> ( 5 ) 

—w 

,-d( k ) = ^2 x w ,-d(k)- (6) 



-a 



After updating, we need to normalize messages locally 
by the normalization factor Z, 



K 

fc=i 



Ak), 



(7) 



which requires K iterative computation. 

Messages are passed from variables to factors, and 
in turn from factors to variables until convergence or 
the maximum number of iterations. Notice that we 
need pass only messages for which x Wj d ^ 0. Because 
x is a very sparse matrix, Eq. ((Hi is computationally 
fast by sweeping only nonzero elements in the sparse 
matrix. Based on the inferred messages, we estimate 
the multinomial parameters 8 and </> by the expectation- 
maximization (EM) algorithm l!T8ll as follows, 



,(k) 



£>.,«!(*)+"]' 

*V-(fc) + fl 



(8) 
(9) 



The synchronous BP |3| updates all messages simulta- 
neously at iteration t based on the messages at previous 
iteration t — 1. Although in practice this schedule often 
converges, it uses more number of training iterations 
until convergence than VB. Notice that the fast conver- 
gence is a desired property for online learning [14] or 
distributed learning |16| for LDA. Therefore, we adopt 
RBP [10J for the rapid convergence based on the in- 
formed schedule of asynchronous message passing. 

The asynchronous schedule updates the message of 
each variable in a certain order, which is in turn used to 
update other neighboring messages immediately at each 
iteration t. The basic idea of RBP is to select the best 
update order based on the messages' residuals r w d(k), 
which are defined as the p-norm of difference between 
two message vectors at successive iterations. Here, we 
choose the L\ norm with p = 1, 



r w ,d(k) = x w4 \tf w d {h) - fS }(k)\, 



(10) 



where x w d is the number of word counts. If we se- 
quentially update messages in a descending order of 



fwd = ^2k r ™,d(k) at each iteration, the RBP algorithm 
theoretically converges faster or more often to a fixed 
point than synchronous BP [10|. Because RBP always 
updates messages with top largest residuals first, it 
efficiently accelerates the convergence than synchronous 
BP. Moreover, the updated messages with top largest 
residuals will in turn influence their neighboring mes- 
sages quickly, which accounts for the fast convergence 
property of RBP More details on theoretical and exper- 
imental analysis of RBP's convergence property can be 
found in (lOl . 

4 Active Belief Propagation 

In this section, we propose ABP within the RBP frame- 
work. We describe in detail how to actively select partial 
documents and topics at each iteration for fast topic 
modeling. We also provide an alternative implementa- 
tion of ABP. Finally, we discuss ABP's relationship to 
previous algorithms. 

4.1 The ABP Algorithm 

At each training iteration, ABP actively selects partial 
documents XdD from corpus for message updating and 
passing, where the parameter A^ 6 (0, 1] denotes the 
scanned corpus proportion. For each document, ABP 
searches partial topic space XkK for message updating 
and passing, where the parameter Afc e (0,1] denotes the 
proportion of the searched topic space. In theory, ABP 
costs only A^A^ training time of BP at each iteration. 

Based on i fTUt , we define the residuals at topics for 
specific documents as 

r d (k) =J2 r ™Ak), (11) 

W 

and accumulate all topic residuals at documents as 

r d = J2 r d(k). (12) 

k 

After each iteration, we sort rd(k) in a descending order 
for each document, and select the subset topics XkK 
with the largest residuals rv(fc). We also sort ra in a 
descending order for all documents, and select the subset 
documents XdD with the largest residuals rj. In the 
following iteration, we update and pass only messages 
for the subset documents XdD and the subset topics XkK, 
and thus save enormous training time in each loop. 

According to the selected XkK topics, we need to 
normalize local messages. At the first iteration t = 1, 
ABP performs the same as the conventional BP, which 
updates and normalizes all messages for all topics. In 
the later iterations, based on residuals, ABP actively 
selects the subset XkK topics for message updating and 
passing. We normalize local messages for the selected 
topics k € XkK by 



Vi.d(k) 
Ek^lJk) 



J2^d(k),keX k K, (13) 
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A*ui,d4(2) 



Fig. 2. Active message passing based on the sorted residuals. 



where is the normalized message in the previous 
iteration, jjL l w d (k) is the normalized message in the cur- 
rent iteration, and $ w Jk) is the unnormalized message 
updated according to (4). In this way, we require only 
XkK iterations for the local message normalization to 
avoid calculating the normalization factor Z in 10 with 
K computation. 

At a negligible computational cost, the sorted residuals 
can be refined during message passing process. The 
computational cost of initial sorting i fTTI l and | [T2|| using 
the standard "quick sort" algorithm are 0(K log K) and 
0(D log D), respectively If the successive residuals are 
in almost sorted order, only a few swaps will restore the 
sorted order by the standard "insertion sort" algorithm, 
thereby saving lots of sorting time. 

Fig. |2] shows an example of active message passing. 
The solid and dashed segments denote normalized mes- 
sages in proportion. The solid segments and arrows 
denote those selected messages for updating and pass- 
ing, while the dashed segments denote those unchanged 
messages. Suppose that documents dl, d2, c?3 and dA 
have top largest residuals rai, r^, and r^. We select 
these four documents in the subset XdD for active mes- 
sage passing. For the document dl, only r w di(l) is in the 
subset XkK so that the message fJ, w ,diO-) is updated and 
passed. For the document d2, three residuals 7^,^2(4), 
?Vd2(3) and r w ,d2{l) are ranked in the subset XkK, 
and thus the messages /W2(4), Mu>,d2(3) and Hw,di{l) 
are updated and passed. Similar rules are applied to 
documents d3 and dA. Because only partial messages are 
updated and passed, ABP consumes significantly less 
computation than BP in each loop. 

The residuals are dynamically refined and sorted at 
each iteration. Due to convergence, the top residuals 
in the previous iterations may be ranked lower in the 
following iterations. As a result, those unchanged mes- 
sages still have the chance to be updated and passed. For 



example in Fig. [2J those dashed segments still have the 
chance to become solid segments in later iterations. In 
this sense, ABP remains the same message information 
as BP. Given the fixed training iterations T, ABP scans 
specific partial documents and searches specific partial 
topics with more iterations than others. In contrast, the 
BP algorithm scans all documents and searches all topics 
with equal number of iterations. 

Fig. [3] summarizes the ABP algorithm. When t = 1, 
ABP scans the entire corpus and searches the complete 
topic space, and in the meanwhile computes and sorts 
residuals by the "quick sort" algorithm. For 2 < t < T, 
ABP actively selects the subset documents XdD and the 
subset topics XkK for message updating and passing. 
After each iteration, ABP dynamically refines and sorts 
residuals by the "insertion sort" algorithm. The ABP 
algorithm terminates when T is reached or convergence 
condition is satisfied. In practice, ABP costs more than 
XdXk training time of BP at each iteration due to addi- 
tional time for refining and sorting residuals. 



4.2 An Alternative Implementation for ABP 

For massive corpora, initially sorting r<j in ((12)1 requires 
at most a computational complexity of OiD log D). This 
cost is very high when D is very large, for example, 
D = 10 8 . Alternatively, we may define residuals at the 
fixed vocabulary, 



and 



>(k) = y"V„, |d (fc), 



(14) 



(15) 



We may sort r w with a significantly less computational 
cost of ©(VFlogVF) because W is often a fixed value 
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input : x,K,T,a,P, X d , X k . 
output : 9 d ,(j) w . 

M° d( k ) *~ random initialization and normalization; 
for d <- 1 to D do 
for fc <- 1 to if do 

1 ^ M°-„, d (fc)+<» M° ,(fc)+ff . 

MLd(fc) «- normalize(/4 d (fe)); 

end 

Afcif <— quick sort(rj(fc), 'descend'); 
end 

A d Z) <- quick sort(r^, 'descend'); 
for i <- 2 to T do 

for d G A ( ;_D do 
for k e A fc A' do 

t cm „ Mi~J t(1 (fc)-t-a 



i(fc)+/3 



<- n ormalize(^ id (A;)); 

r w,dQ t ) *~ x w,d\fl> w dW 



(fc)l; 



end 



A fc 7f <— insertion sort(r^(fc), 'descend'); 

end 

A d _D <— insertion sort(r^, 'descend'); 
end 

«j(^Wt) + al/E») + 4 



Fig. 3. The ABP algorithm. 

when D is a huge number for a massive copora. In 
this case, ABP actively selects partial vocabulary words, 
X W W, Am 6 (0,1], for topic modeling at each iteration. 
Due to page limitation of this paper, we do not show 
the experimental results of this alternative. Interested 
readers may find source codes of this implementation 
in E9). 

4.3 Relationship to Previous Algorithms 

The proposed ABP algorithm adopts a simple idea to 
accelerate the topic modeling process based on residu- 
als. At each iteration, it selects only a mini-batch from 
the entire corpus for message updating and passing. 
From this perspective, ABP resembles the online learning 
algorithms fT3]| , [14] but with two main distinctions. 
First, online algorithms randomly read each mini-batch 
from a data stream without selection. Second, online 
algorithms use each mini-batch only once, which may 
cause a lower topic modeling accuracy [13]. Technically, 
ABP can converge to the local maximum of the objective 
function as BP, while online algorithms require specific 
parameter settings to achieve this goal fl4l . 

Unlike the parallel algorithms [ll), CE) for LDA, ABP 
uses a single-processor to perform topic modeling with- 
out complicated parallel architectures. However, due to 
memory limitation, it is hard to fit massive corpora with 
large number of topics into the memory of a common PC 



TABLE 1 

Statistics of four document data sets. 



Data sets 


D 


W 


N d 


w d 


NIPS 


1500 


12419 


311.8 


217.6 


ENRON 


39861 


28102 


160.9 


93.1 


NYTIMES 


15000 


84258 


328.7 


230.2 


PUBMED 


80000 


76878 


68.4 


46.7 



(for example, 3G RAM). We have two straightforward 
strategies to solve this problem. First, we may extend 
ABP to online and parallel learning. Online learning 
requires a fixed memory for each mini-batch data, while 
parallel learning has the large memory space based on 
parallel architectures. Second, we may adopt the block 
optimization framework [20], which sequentially reads 
blocks of data from the hard disk into memory. 

The FGS algorithm fl7l is an important improvement 
over the conventional GS algorithm when K is very 
large. The basic idea is to combine both message update 
and sampling process together by introducing an upper 
bound Z for the normalization factor Z. FGS further 
refines the upper bound to approximate the true nor- 
malization factor, so that the sampled topic label equals 
that drawn from the true posterior probability. Similar 
to FGS, we may define an upper bound and pass only 
top largest segments fi w> d(k)/Z in the BP algorithm to 
reduce the computation. However, this strategy does not 
work because we will not be able to update and pass 
those short segments in Fig. [2] leading to serious loss of 
information. While such a strategy is faster, its accuracy 
is much lower than the conventional BP algorithm. To 
avoid information loss, ABP is based on the RBP frame- 
work by dynamically refining and sorting residuals, 
which determines the subsect XkK topic space at each 
iteration. Through the dynamical residual sorting, ABP 
has the chance to scan all documents and search topic 
space without information loss. 

5 Experiments 

Our experiments aim to verify the accelerating effects of 
ABP compared with VB 0, GS GO, FGS G2, and BP 
algorithms. We use four publicly available document 
data sets [17]: NIPS, ENRON, NYTIMES and PUBMED. 
Previous studies [17] revealed that the speedup effect is 
relatively insensitive to the total number of documents 
in the corpus. Because of the memory limitation, we 
randomly select 15000 documents from the original NY- 
TIMES data set, and 80000 documents from the original 
PUBMED data set for experiments. Table [TJ summarizes 
the statistics of four data sets, where D is the total 
number of documents in the corpus, W is the number 
of words in the vocabulary, N d is the average number 
of word tokens per document, and W d is the average 
number of word indices per document. 

We randomly partition each data set into halves with 
one for training set and the other for test set. We calculate 



7 



the training perplexity |5| on the training set after 500 
iterations as follows, 

V = cxpi '■ — S . (16) 

Usually, the training perplexity will decrease with the 
increase of number of training iterations. The algorithm 
often converges if the change of training perplexity at 
successive iterations is less than a predefined threshold. 
In our experiments, we set the threshold as one because 
the decrease of training perplexity is very small after 
satisfying this threshold. 

The predictive perplexity for the unseen test set is 
computed as follows [5 J. On the training set, we esti- 
mate </> from the same random initialization after 500 
iterations. For the test set, we randomly partition each 
document into 80% and 20% subsets. Fixing <f>, we esti- 
mate 8 on the 80% subset by training algorithms from 
the same random initialization after 500 iterations, and 
then calculate the predictive perplexity on the rest 20% 
subset, 



V = exp ■ 



,.20% 



(17) 



where x^°d denotes word counts in the the 20% sub- 
set. The lower predictive perplexity represents a better 
generalization ability. 

For all data sets, we fix the same hyperparameters 
as a = 2/K,P = 0.01 [Y7\. The CPU time per iter- 
ation is measured after sweeping the entire data set. 
We report the average CPU time per iteration after 
T = 500 iterations, which practically ensures that GS and 
FGS converges in terms of training perplexity. For fair 
comparison, we use the same random initialization to ex- 
amine all algorithms with 500 iterations. We implement 
all algorithms based on the MEX C++ /MATLAB/ Octave 
platform, where GS and FGS C++ source codes are 
the same as FastLDA [17J. To repeat our experiments, 
we have made all source codes and data sets publicly 
available (191 . All experiments are run on the Sun Fire 
X4270 M2 server with two 6-core 3.46 GHz CPUs and 
128 GB RAMs. 

5.1 Parameters X d and A fc 

First, we examine two parameters A^ and Afc in ABP on 
the relatively smaller NIPS data set. We are wondering 
how these two parameters would influence the topic 
modeling accuracy of ABP. The parameter A^ G (0, 1] 
controls the proportion of documents to be scanned, 
and the parameter Afc £ (0, 1] defines the proportion 
of topics to be searched at each iteration. The smaller 
values correspond to the faster speed of ABP. We choose 
the training perplexity of ABP with 500 iterations when 
Xd = At = 1 as the benchmark. The relative training 
perplexity is the difference between this benchmark and 
the ABP's training perplexity with 500 iterations at other 
parameter values. 



Fixing Afc = 1, we change Xd from 0.1 to 0.5 with the 
step 0.1. Fig. H] shows the relative training perplexity 
as a function of X d when K = {100,300,500,700,900}. 
We see that the relative perplexity decreases when A^ 
increases, which confirms the fact that scanning more 
documents at each iteration will yield the lower training 
perplexity under different topics K. Notice that the 
relative perplexity is larger when K is smaller, while it is 
smaller when K is larger. This phenomenon shows that 
when K is very large, even scanning a small portion 
of documents is enough to provide a comparable topic 
modeling accuracy. When Xd > 0.2 we also see that 
the relative perplexity is less than 20, which in practice 
is a negligible difference on a common document data 
set. Therefore, we think that A^ = 0.2 is a safe bound 
to guarantee a good topic modeling accuracy with a 
relatively faster speed. Although smaller Xd < 0.2 will 
produce faster speed, it will cause an obvious degrade 
of topic modeling accuracy. 

Fixing Xd = 1, we also change Afc from 0.1 to 0.5 with 
the step 0.1. Fig. [5] shows the relative training perplexity 
as a function of A fc when K e {100,300,500,700,900}. 
Surprisingly, there is no big difference when Afc = 0.1 
and Afc = 0.5 especially when K = 700. This phe- 
nomenon implies that only a small proportion of topics 
plays the major role when K is very large. When Afc < 
0.5, ABP achieves even a lower perplexity value than 
ABP with Afc = 1. The reason is that most documents 
have very sparse topic proportions when K is very 
large, and thus searching partial topic space is enough to 
yield a comparable topic modeling accuracy. We wonder 
whether Afc can be even smaller when K is very large, 
e.g., K e {1500,2000}. On NIPS data set, ABP with 
Afc = 0.05 achieves 555.89 and 542.70 training perplexity, 
respectively. In contrast, ABP with Afc = 1 achieves 
543.90 and 533.97 training perplexity, respectively. The 
relative training perplexity is less than 2%. Therefore, 
we speculate that when K is very large like K > 2000, 
X k K may be a constant, e.g., XkK = 100. In this case, the 
training time of ABP is independent of K. This bound 
XkK = 100 is reasonable because usually a common 
word is unlikely to be associated with more than 100 
topics. 

Users may set different parameters Xd and Afc for dif- 
ferent speedup effects. To pursue the maximum speedup, 
we choose A^ = Afc =0.1 referred to as ABP1 in the 
rest of our experiments. To ensure the topic modeling 
accuracy, we also choose the safe bound Xd = Afc = 0.2 
referred to as ABP2. Obviously, ABP1 is much faster 
than ABP2 but with relatively lower accuracy in terms 
of training perplexity. 

5.2 Predictive Perplexity 

The predictive perplexity is a widely-used performance 
measure for the generalization ability |2J, especially for 
different training algorithms of LDA. Fig. [6] compares 
the predictive perplexity as a function of topics K = 
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Fig. 4. Relative training perplexity as a function of the parameter when K = {100, 300, 500, 700, 900}. 
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Fig. 6. Predictive perplexity as a function of topics K = {100,300,500,700,900} on NIPS, ENRON, NYTIMES and 
PUBMED data sets. The notation 0.8a; denotes the predictive perplexity is multiplied by 0.8. 



{100,300,500,700,900} by VB, GS/FGS, BP, ABP1 and 
ABP2 algorithms. FGS and GS have exactly the same 
topic modeling accuracy, so that their predictive perplex- 
ity curves overlap denoted by blue squares. 

On ENRON, NYTIMES and PUBMED data sets, VB 
shows an obvious overfitting phenomenon, where the 
predictive perplexity increases with the number of topics 
K. Also, VB often yields the highest predictive per- 
plexity. For a better visualization, we multiply VB's 
predictive perplexity value by 0.8 and 0.4 on ENRON, 
NYTIMES and PUBMED data sets, respectively. There- 
fore, under a large number of topics on large data sets, 
VB often generalizes badly to predict unseen test set. The 
major reason is that VB aims to optimize an approximate 
variational distribution with a gap from the objective 
joint distribution. Experiments show that this gap or 
bias is obvious in case of the massive data set with the 
large number of topics. Although VB may choose the 
proper hyperparameters to correct this bias |5], it is still 
unreliable for inferring thousands of topics from massive 
data sets. 

On all data sets, BP consistently performs the best with 
the lowest predictive perplexity. Unlike VB, BP decreases 



the predictive perplexity with the increase of the number 
of topics K without overfitting. Theoretically, there is 
no gap because BP directly infers the marginal poste- 
rior probability called message from the objective joint 
probability. So, BP always yields the lowest predictive 
perplexity with the best generalization ability in practice. 

With the parameters X d = X k = 0.2, ABP2 per- 
forms almost the same as BP, especially on ENRON and 
PUBMED data sets. This result shows that scanning 20% 
documents and searching 20% topics at each iteration is 
enough to provide a comparable topic modeling accu- 
racy as BP With the parameters X d = A fc = 0.1, ABP1 
provides a much higher predictive perplexity than BP 
On the relatively smaller data set NIPS, ABP1 performs 
even worse than VB when K = 100. However, on the 
relatively large data set PUBMED and when K is large, 
e.g., K = 900, ABP1 shows a comparable predictive 
perplexity as BP. When compared with GS/FGS, ABP1 
is worse under different topics on the relatively small 
NIPS set, but is much better on the large data sets such 
as ENRON, NYTIMES and PUBMED when K is large. 

In conclusion, ABP2 often yields a comparable perfor- 
mance as BP, and outperforms GS/FGS and VB on all 
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data sets. While ABP1 has a higher predictive perplexity 
than ABP2 and BP on all data sets, it often outperforms 
GS/FGS and VB for larger data sets with larger number 
of topics. 

5.3 Speedup Effects 

Fig. [7] shows the CUP time (seconds) per iteration as a 
function of K for VB, GS, FGS, BP, ABP1 and ABP2. 
The training time of all algorithms increases linearly 
with K except FGS on the PUBMED data set. When 
K is very large, FGS [17] will locate the targeted topics 
more efficiently so that its training time does not increase 
linearly with K. Because FGS does not visit all possible 
topic space to sample the targeted topic, it is much faster 
than GS when K is very large. Notice that FGS still needs 
to scan the entire corpus to sample topics for each word 
token. VB consumes the longest training time because of 
its complicated digamma function calculation [3]. For a 
better illustration, we multiply VB's training time by 0.2. 
This result confirms that VB is unsuitable for fast topic 
modeling of massive data sets. 

The training time of BP is close to that of GS when 
K is small, but is longer than that of GS when K is 
large on ENRON, NYTIMES and PUBMED data sets. 
The major reason is that BP requires visiting all topic 
space, while GS stops visiting the rest topic space until 
sampling the target topic. When Ad = Xk = 0.1, ABP1 in 
theory requires only 1/100 training time of BP. However, 
ABP1 needs to sort and update residuals so that it 
on average consumes around 1/30 training time of BP. 
When Xd = Xk — 0.2, ABP2 theoretically requires four 
times longer training time than ABP1. But in practice 
ABP1 runs around three times faster than ABP2. The 
speedup effects are significant. In conclusion, Fig. 
demonstrates that both ABP1 and ABP2 are at present 
the fastest algorithms for training LDA. 

Fig. [8] shows the ratio of CPU time per iteration over 
ABP1 as the benchmark. For a better illustration, we 
multiply the ratio of VB by 0.2. We see that ABP1 runs at 
least 100 times faster than VB on NIPS, NYTIMES and 
PUBMED data sets. On ENRON set, we also see that 
ABP1 is close to 100 times faster than VB. Such a signif- 
icant speedup facilitates fast topic modeling of massive 
data sets. For example, if VB uses three months for train- 
ing LDA, ABP1 consumes only one day for the same task 
with much better generalization ability shown in Fig. [6] 
ABP1 also runs around 20 to 40 times faster than both GS 
and BP. When compared with relatively faster FGS, ABP1 
still runs around 10 to 20 times faster with comparable 
predictive perplexity. ABP2 consistently consumes 2 to 
3 times more training time than ABP1. Both Figs. [6] 
and [8] reconfirm that the proposed ABP algorithm can be 
around 10 to 100 times faster than the current state-of- 
the-art algorithms, while retains a comparable accuracy 
in topic modeling of massive data sets. Therefore, we 
advocate ABP1 for fast topic modeling of the massive 
data set when D and K are large. 



5.4 Convergence 

Fig. [9] shows the training perplexity as a function of 
number of training iterations when K = 500. We see 
that GS and FGS converge at the same speed (overlapped 
curves) but converge much slower than both ABP2 and 
BP. Usually, GS and FGS uses 400 - 500 iterations to 
achieve convergence. Among all algorithms, VB con- 
verges fastest, which often uses around 80 ~ 150 itera- 
tions for convergence. Although VB sometimes achieves 
the lowest training perplexity on NIPS and NYTIMES 
data sets, it yields the highest predictive perplexity 
on unseen test set by overfitting. We see that ABP2 
converges slightly slower than BP, which is caused by 
scanning partial corpus and searching partial topic space 
at each iteration. In our experiments, BP uses around 
100 - 200 iterations, while ABP2 uses around 300 - 500 
iterations to achieve convergence. On NYTIMES and 
PUBMED data sets, ABP2 converges with almost the 
same speed as GS and FGS algorithms. Furthermore, 
ABP1 converges significantly slower than ABP2 on all 
data sets. It often uses 500 ~ 800 iterations for conver- 
gence. While at each iteration ABP1 scans 10% corpus 
and searches 10% topics for fast speed, it significantly 
slows down the convergence speed than BP. However, 
even if we compare the training time until convergence, 
ABP1 is still the fastest algorithm. Fig. [10] summarizes 
the training time ratio until convergence over ABP1. We 
see that ABP1 still runs around 20 ~ 40 times faster than 
VB, 10 ~ 20 times faster than GS, and 5 ~ 10 times faster 
than both FGS and BP. 

5.5 Topic Visualization 

Fig. [IT] shows the top ten words of ten topics extracted 
by VB (red), GS/FGS (blue), BP (black) and ABP (green) 
algorithms on the NIPS training set. For ABP, we choose 
Xd = Afc = 0.5 because K = 10 is small. We see that 
all algorithms can infer semantically meaningful topics, 
where most top ten words are consistent except the 
slightly different word ranking. There are two subjective 
measures for the interpretability of discovered topics: 
word intrusion and topic intrusion (2l|. The former is 
the number of word intruders in each topic, while the 
latter is the number of topic intruders in each document. 
According to our experience, the word and topic intru- 
sions are comparable among all algorithms. Given the 
similar interpretable topics, we advocate ABP for fast 
topic modeling of massive data sets. 

6 Conclusions 

Fast training algorithms for LDA has attracted intensive 
interests recently. For fast topic modeling the real-world 
massive corpora with large number of topics, the online 
and parallel learning algorithms are two straightforward 
strategies. The proposed ABP algorithm provides the 
third alternative for fast topic modeling, which may also 
inspire more efficient online and parallel algorithms. Ex- 
tensive experiments on large data sets demonstrate that 
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PUBMED data sets. 



ABP accelerates the topic modeling process significantly, 
and in the meanwhile remains a comparable topic mod- 
eling accuracy. As with the conventional BP algorithm, 
ABP scales linearly with the number of documents D 
and the number of topics K but with a fraction Xd^k *C 1. 

In the various test sets, we find that at each iteration 
only 20% documents and 20% topics are enough to 
yield almost the same topic modeling performance as 
BP, which seems to follow the Pareto principle or 80- 
20 rultQ, stating that roughly 80% of the effects come 
from 20% of the causes. Furthermore, we find that ABP 

1. http://en.wikipedia.org/wiki/Pareto_principle 



can choose even much smaller parameters and 
under the large D and K. This result implies that not all 
documents and topics contribute equally for the topic 
modeling performance of massive corpora with large 
number of topics. For example, at each iteration only 10% 
documents and 10% topics can provide comparable or 
much better performance than GS/FGS algorithms when 
K = 900. So, we speculate that the subsets XdD and XkK 
can be constants for massive data sets (D > 10 6 ) with 
large number of topics (K > 1000). In this way, ABP may 
use a fixed training time for probabilistic topic modeling 
of these massive data sets. 
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Fig. 10. Ratio of training time until convergence over ABP1 as a function of topics K = {100,300,500,700,900} on 
NIPS, ENRON, NYTIMES and PUBMED data sets. 
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Fig. 11. Top ten words of ten topics on NIPS training set: VB (red), GS/FGS (blue), BP (black) and ABP (green). 
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Because of its ease of use and fast speed, ABP is a 
strong candidate for becoming the standard LDA train- 
ing algorithm. Future work includes the relaxation of 
memory requirements for massive corpora. We may ex- 
tend ABP within the block optimization framework |20|, 
reading data as blocks from hard disk into memory. With 
fast speed and little memory we may realize practical 
topic modeling softwares for massive personal data in- 
cluding documents, photos, videos on a common PC in 
the near future. 
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